use eq
- Information Technology > Artificial Intelligence > Robots (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
Chen, Ziyi, Li, Junyi, Yu, Peiran, Huang, Heng
Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Tarrant County > Arlington (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Robots (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Achieve Performatively Optimal Policy for Performative Reinforcement Learning
Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there is a provably positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time convergence to the desired PO} policy under the standard regularizer dominance condition. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point (not PS) of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $Π_Δ$, where the policy value has a constant lower bound $Δ>0$ and thus the gradient becomes bounded and Lipschitz continuous. Experimental results also demonstrate that our 0-FW algorithm is more effective than the existing algorithms in finding the desired PO policy.
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning
Chen, Ziyi, Zhang, Su, Huang, Heng
Federated learning has attracted increasing attention at recent large-scale optimization and machine learning research and applications, but is also vulnerable to Byzantine clients that can send any erroneous signals. Robust aggregators are commonly used to resist Byzantine clients. This usually requires to estimate the unknown number $f$ of Byzantine clients, and thus accordingly select the aggregators with proper degree of robustness (i.e., the maximum number $\hat{f}$ of Byzantine clients allowed by the aggregator). Such an estimation should have important effect on the performance, which has not been systematically studied to our knowledge. This work will fill in the gap by theoretically analyzing the worst-case error of aggregators as well as its induced federated learning algorithm for any cases of $\hat{f}$ and $f$. Specifically, we will show that underestimation ($\hat{f}
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)
A Cubic Regularization Approach for Finding Local Minimax Points in Nonconvex Minimax Optimization
Chen, Ziyi, Hu, Zhengyang, Li, Qunwei, Wang, Zhe, Zhou, Yi
Gradient descent-ascent (GDA) is a widely used algorithm for minimax optimization. However, GDA has been proved to converge to stationary points for nonconvex minimax optimization, which are suboptimal compared with local minimax points. In this work, we develop cubic regularization (CR) type algorithms that globally converge to local minimax points in nonconvex-strongly-concave minimax optimization. We first show that local minimax points are equivalent to second-order stationary points of a certain envelope function. Then, inspired by the classic cubic regularization algorithm, we propose an algorithm named Cubic-LocalMinimax for finding local minimax points, and provide a comprehensive convergence analysis by leveraging its intrinsic potential function. Specifically, we establish the global convergence of Cubic-LocalMinimax to a local minimax point at a sublinear convergence rate and characterize its iteration complexity. Also, we propose a GDA-based solver for solving the cubic subproblem involved in Cubic-LocalMinimax up to certain pre-defined accuracy, and analyze the overall gradient and Hessian-vector product computation complexities of such an inexact Cubic-LocalMinimax algorithm. Moreover, we propose a stochastic variant of Cubic-LocalMinimax for large-scale minimax optimization, and characterize its sample complexity under stochastic sub-sampling. Experimental results demonstrate faster convergence of our stochastic Cubic-LocalMinimax than some existing algorithms.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Utah (0.04)
- Europe > Russia (0.04)
- (2 more...)
Momentum with Variance Reduction for Nonconvex Composition Optimization
Composition optimization is widely-applied in nonconvex machine learning. Various advanced stochastic algorithms that adopt momentum and variance reduction techniques have been developed for composition optimization. However, these algorithms do not fully exploit both techniques to accelerate the convergence and are lack of convergence guarantee in nonconvex optimization. This paper complements the existing literature by developing various momentum schemes with SPIDER-based variance reduction for non-convex composition optimization. In particular, our momentum design requires less number of proximal mapping evaluations per-iteration than that required by the existing Katyusha momentum. Furthermore, our algorithm achieves near-optimal sample complexity results in both non-convex finite-sum and online composition optimization and achieves a linear convergence rate under the gradient dominant condition. Numerical experiments demonstrate that our algorithm converges significantly faster than existing algorithms in nonconvex composition optimization.
- North America > United States > Utah (0.04)
- Asia > Middle East > Jordan (0.04)